transferability estimation
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- North America > United States > Louisiana (0.04)
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- North America > United States > Louisiana (0.04)
How NOT to benchmark your SITE metric: Beyond Static Leaderboards and Towards Realistic Evaluation
Singh, Prabhant, Hess, Sibylle, Vanschoren, Joaquin
Transferability estimation metrics are used to find a high-performing pre-trained model for a given target task without fine-tuning models and without access to the source dataset. Despite the growing interest in developing such metrics, the benchmarks used to measure their progress have gone largely unexamined. In this work, we empirically show the shortcomings of widely used benchmark setups to evaluate transferability estimation metrics. We argue that the benchmarks on which these metrics are evaluated are fundamentally flawed. We empirically demonstrate that their unrealistic model spaces and static performance hierarchies artificially inflate the perceived performance of existing metrics, to the point where simple, dataset-agnostic heuristics can outperform sophisticated methods. Our analysis reveals a critical disconnect between current evaluation protocols and the complexities of real-world model selection. To address this, we provide concrete recommendations for constructing more robust and realistic benchmarks to guide future research in a more meaningful direction.
- Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
- Oceania > Australia > New South Wales > Sydney (0.04)
- Europe > Switzerland (0.04)
Estimating Time Series Foundation Model Transferability via In-Context Learning
Yao, Qingren, Jin, Ming, Zhang, Chengqi, Yang, Chao-Han Huck, Qi, Jun, Pan, Shirui
Time series foundation models (TSFMs) offer strong zero-shot forecasting via large-scale pre-training, yet fine-tuning remains critical for boosting performance in domains with limited public data. With the growing number of TSFMs, efficiently identifying the best model for downstream fine-tuning becomes increasingly challenging. Leveraging the natural tabular structure formed by dataset meta-features, model characteristics, and fine-tuned performance, we employ tabular foundation models to serve as in-context learners. We establish a comprehensive benchmark for transferability estimation including 10 datasets, 10 foundation models, and 3 forecasting tasks. 's estimation demonstrates strong alignment with actual fine-tuned performance for previously unseen datasets, achieving a mean rank correlation of approximately 0.6 and a 30% improvement compared to using zero-shot performance as the transferability score. The emergence of time series foundation models (TSFMs) is reshaping the paradigm of time series forecasting (Liang et al., 2025) through their strong zero-shot capabilities. Although efficient and cost-effective, zero-shot inference often underperforms in out-of-distribution scenarios, particularly in domains with limited public data, such as healthcare (Gupta et al., 2024) and finance (Fu et al., 2024). Fine-tuning helps bridge the gap by transferring generalized knowledge from large-scale pre-training to specific, resource-limited downstream tasks (Li & Zhu, 2025). However, due to the inherent diversity of time series data, no single model consistently outperforms others in all scenarios (Brigato et al., 2025).
- Asia > China > Hong Kong (0.04)
- North America > United States > Maryland > Baltimore (0.04)
- North America > United States > Georgia > Fulton County > Atlanta (0.04)
- (2 more...)
- Information Technology > Data Science > Data Mining (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Analysis of Transferability Estimation Metrics for Surgical Phase Recognition
Singh, Prabhant, Li, Yiping, Khalil, Yasmina Al
Fine-tuning pre-trained models has become a cornerstone of modern machine learning, allowing practitioners to achieve high performance with limited labeled data. In surgical video analysis, where expert annotations are especially time-consuming and costly, identifying the most suitable pre-trained model for a downstream task is both critical and challenging. Source-independent transferability estimation (SITE) offers a solution by predicting how well a model will fine-tune on target data using only its embeddings or outputs, without requiring full retraining. In this work, we formalize SITE for surgical phase recognition and provide the first comprehensive benchmark of three representative metrics, LogME, H-Score, and TransRate, on two diverse datasets (RAMIE and AutoLaparo). Our results show that LogME, particularly when aggregated by the minimum per-subset score, aligns most closely with fine-tuning accuracy; H-Score yields only weak predictive power; and TransRate often inverses true model rankings. Ablation studies show that when candidate models have similar performances, transferability estimates lose discriminative power, emphasizing the importance of maintaining model diversity or using additional validation. We conclude with practical guidelines for model selection and outline future directions toward domain-specific metrics, theoretical foundations, and interactive benchmarking tools.
- Health & Medicine > Diagnostic Medicine > Imaging (0.95)
- Health & Medicine > Surgery (0.65)
Benchmarking Transferability: A Framework for Fair and Robust Evaluation
Kazemi, Alireza, Rezvani, Helia, Baktashmotlagh, Mahsa
Transferability scores aim to quantify how well a model trained on one domain generalizes to a target domain. Despite numerous methods proposed for measuring transferability, their reliability and practical usefulness remain inconclusive, often due to differing experimental setups, datasets, and assumptions. In this paper, we introduce a comprehensive benchmarking framework designed to systematically evaluate transferability scores across diverse settings. Through extensive experiments, we observe variations in how different metrics perform under various scenarios, suggesting that current evaluation practices may not fully capture each method's strengths and limitations. Our findings underscore the value of standardized assessment protocols, paving the way for more reliable transferability measures and better-informed model selection in cross-domain applications. Additionally, we achieved a 3.5% improvement using our proposed metric for the head-training fine-tuning experimental setup. Our code is available in this repository: https://github.com/alizkzm/
- Europe > Switzerland > Zürich > Zürich (0.14)
- Oceania > Australia > Queensland > Brisbane (0.04)
Feature Space Perturbation: A Panacea to Enhanced Transferability Estimation
Khoba, Prafful Kumar, Wang, Zijian, Arora, Chetan, Baktashmotlagh, Mahsa
Most existing metrics primarily focus on identifying the statistical relationship between feature embeddings and the corresponding labels within the target dataset, but overlook crucial aspect of model robustness. This oversight may limit their effectiveness in accurately ranking pre-trained models. T o address this limitation, we introduce a feature perturbation method that enhances the transferability estimation process by systematically altering the feature space. Our method includes a Spread operation that increases intra-class variability, adding complexity within classes, and an Attract operation that minimizes the distances between different classes, thereby blurring the class boundaries. Through extensive experimentation, we demonstrate the efficacy of our feature perturbation method in providing a more precise and robust estimation of model transferability. Notably, the existing LogMe method exhibited a significant improvement, showing a 28.84 % increase in performance after applying our feature perturbation method. The implementation is available at https://github.
Occam's model: Selecting simpler representations for better transferability estimation
Singh, Prabhant, Hess, Sibylle, Vanschoren, Joaquin
Fine-tuning models that have been pre-trained on large datasets has become a cornerstone of modern machine learning workflows. With the widespread availability of online model repositories, such as Hugging Face, it is now easier than ever to fine-tune pre-trained models for specific tasks. This raises a critical question: which pre-trained model is most suitable for a given task? This problem is called transferability estimation. In this work, we introduce two novel and effective metrics for estimating the transferability of pre-trained models. Our approach is grounded in viewing transferability as a measure of how easily a pre-trained model's representations can be trained to separate target classes, providing a unique perspective on transferability estimation. We rigorously evaluate the proposed metrics against state-of-the-art alternatives across diverse problem settings, demonstrating their robustness and practical utility. Additionally, we present theoretical insights that explain our metrics' efficacy and adaptability to various scenarios. We experimentally show that our metrics increase Kendall's Tau by up to 32% compared to the state-of-the-art baselines.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Oceania > Australia > New South Wales > Sydney (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- (2 more...)
Selection, Ensemble, and Adaptation: Advancing Multi-Source-Free Domain Adaptation via Architecture Zoo
Pei, Jiangbo, Li, Ruizhe, Men, Aidong, Liu, Yang, Zhuang, Xiahai, Chen, Qingchao
Conventional Multi-Source Free Domain Adaptation (MSFDA) assumes that each source domain provides a single source model, and all source models adopt a uniform architecture. This paper introduces Zoo-MSFDA, a more general setting that allows each source domain to offer a zoo of multiple source models with different architectures. While it enriches the source knowledge, Zoo-MSFDA risks being dominated by suboptimal/harmful models. To address this issue, we theoretically analyze the model selection problem in Zoo-MSFDA, and introduce two principles: transferability principle and diversity principle. Recognizing the challenge of measuring transferability, we subsequently propose a novel Source-Free Unsupervised Transferability Estimation (SUTE). It enables assessing and comparing transferability across multiple source models with different architectures under domain shift, without requiring target labels and source data. Based on above, we introduce a Selection, Ensemble, and Adaptation (SEA) framework to address Zoo-MSFDA, which consists of: 1) source models selection based on the proposed principles and SUTE; 2) ensemble construction based on SUTE-estimated transferability; 3) target-domain adaptation of the ensemble model. Evaluations demonstrate that our SEA framework, with the introduced Zoo-MSFDA setting, significantly improves adaptation performance (e.g., 13.5% on DomainNet). Additionally, our SUTE achieves state-of-the-art performance in transferability estimation.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
- Asia > China > Beijing > Beijing (0.05)
- Asia > Middle East > Jordan (0.04)
- (3 more...)
KITE: A Kernel-based Improved Transferability Estimation Method
Transferability estimation has emerged as an important problem in transfer learning. A transferability estimation method takes as inputs a set of pre-trained models and decides which pre-trained model can deliver the best transfer learning performance. Existing methods tackle this problem by analyzing the output of the pre-trained model or by comparing the pre-trained model with a probe model trained on the target dataset. However, neither is sufficient to provide reliable and efficient transferability estimations. In this paper, we present a novel perspective and introduce Kite, as a Kernel-based Improved Transferability Estimation method. Kite is based on the key observations that the separability of the pre-trained features and the similarity of the pre-trained features to random features are two important factors for estimating transferability. Inspired by kernel methods, Kite adopts centered kernel alignment as an effective way to assess feature separability and feature similarity. Kite is easy to interpret, fast to compute, and robust to the target dataset size. We evaluate the performance of Kite on a recently introduced large-scale model selection benchmark. The benchmark contains 8 source dataset, 6 target datasets and 4 architectures with a total of 32 pre-trained models. Extensive results show that Kite outperforms existing methods by a large margin for transferability estimation.
- North America > United States > Texas (0.04)
- North America > United States > New York (0.04)